library(tidyverse)
library(rethinking)
library(dagitty)
library(knitr)
library("ggdag")
library("ggrepel")
Epidemiology,5th edition, Leon Gordis
Causal Inference: What If, by Miguel A. Hernán, James M. Robins
Causal inference in statistics: an overview, Pearl 2009
An introduction to causal inference, Pearl 2010
https://malco.io/2019/09/17/tidy-causal-dags-with-ggdag-0-2-0/
https://cran.r-project.org/web/packages/ggdag/vignettes/intro-to-ggdag.html
https://cran.r-project.org/web/packages/ggdag/vignettes/bias-structures.html
Types of Studies Experimental Clinical Trials Intervention Trials Prevention Trials Field Trials Observational Cross-sectional studies Cohort Studies (retrospective and prospective) Case-control Studies (including “nested”) Matched Case-control studies Ecological studies
Sensitivity and specificity Predictive value positive and predictive value negative Likelihood ratios (binary, ordinal and quantitative tests) Comparison of sensitivity and specificity of 2 tests Prevalence/apparent prevalence relationship Sensitivity, specificity and predictive values of tests in series and parallel Kappa for interobserver agreement ROC curves
FOR PROJECT:sensitivity /spp of DENV PCR test and of surveillance method. Validation of PCR test How do you define dengue exposure? What’s the PRNT cut-off?
Proportion of the population affected at time t = snapshot of disease.
Units: 0-1 or 0-100%
\[\frac{cases\:at\:time\:t\:(new + existing)}{total\:population\:at\:time\:t}\] Risk of being a case.
\[prevalence = incidence * duration\] It’s a bad measure of risk because it depends on the duration of disease. Chronic diseases will have high prevalnce, and very fatal diseases will have low prevalence, regardless of the incidence.
\[\frac{cases\:at\:time\:t\:(new + existing)}{total\:population\:at\:time\:t}\]
\[\frac{cases\:observed\:over\:period\:t\:(new + existing)}{total\:population\:at\:midpoint\:of\:period\:t}\]
Proportion of the population at risk of being affected that does become affected during a time period t. cases/population * time at risk
\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:during\:period\:t}\]
Risk of becoming a case.
\[\frac{cases}{population*time\:at\:risk}\] It measures risk because it measures events or transitions from affected to not affected state.
The population at risk is a crude measure of the population at risk at the beginning of the time period. It assumes a static population at risk.
Units: 0-1 or 0-100% per time interval.
\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:at\:start\:of\:time\:period\:t}\] Measures average risk. Is apt for short time-periods or static populations.
The population at risk is the sum of all the disease-free/ at risk time periods for each individual. It assumes the risk of each person in the population does not change over time.
Units: 0- \(\infty\) cases/population-time
\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population-time}\] Measures risk by taking into account the time elapsed before disease occured for each individual, thus it also measures the speed at which disease occurs at a certain timepoint. Is apt for prolonged time-periods or dynamic populations.
To calculate population-time:
Proportion of the exposed individuals that becomes affected during a time period t.
\[\frac{cases}{exposed}\]
Relationship between incidence and prevalence. Gordis
Speed of death in time t.
Measures risk: good measure when disease is mild, bad measure when disease is very deadly and the case-fatality is high.
\[\frac{deaths\:over\:period\:t}{total\:population\:at\:risk\:during\:time\:period\:t}\]
overall deaths
\[\frac{deaths\:over\:period\:t}{total\:population\:at\:risk\:during\:time\:period\:t}\]
deaths in a specific subgroup (age, sex, diseased with a certain disease)
\[\frac{deaths\:in\:subgroup\:over\:period\:t}{population\:at\:risk\:in\:subgroup\:during\:time\:period\:t}\]
Proportion of the individuals that become affected by disease X who die during a time period t.
Measures disease severity
\[\frac{deaths}{cases}\]
Fraction of all the deaths caused by disease X
\[\frac{deaths\:from\:disease\:X}{all\:deaths}\]
overall population
adjusted rates controlling for confounding factors to remove the effect of that factor
Apply the specific subgroup rates of each population to a standard population and calculate the rate on the standard population.
direct adjusted example from Gordis
Compare populations: subgroup vs general
SMR = Strandard Mortality Ratio \[SMR= \frac{Observed}{Expected}\]
Expected: Apply the general population rates to each specific subgroup and add all the cases Observed: add all the observed cases in each specific subgroup
indirect adjusted example from Gordis
Measures of effect compare an exposed population to it’s counterfactual unexposed population, that is the exact same population at the same time point had it not been exposed. That is, the effect of \(E^+\) on the probability of being \(D^+\) in the SAME population.
Measures of association compare one exposed population to another unexposed population (a different population or the same population at a different time point) assuming that both populations are comparable. That is the effect of \(E^+\) on the probability of being \(D^+\) between \(E^-\) and \(E^+\).
causal types
Doomed = always has disease, exposed or not
Susceptible = has disease when exposed
Protected= does not have disease when exposed, but has disease when unexposed
Immune= never has disease, exposed or not
\(P(D^+|E^+) = p_1 + p_2 = doomed + susceptible\)
\(P(D^-|E^+) = p_3 + p_4 = protected + immune\)
\(P(D^+|E^-) = p_1 + p_3 = doomed + protected\)
\(P(D^-|E^-) = p_2 + p_4 = susceptible + immune\)
| Disease + | Disease - | Total | |
|---|---|---|---|
| Exposed + | a = E(+) & D(+) | b = E(+) & D(-) | a+b = E(+) |
| Exposed - | c = E(-) & D(+) | d = E(-) & D(-) | c+d = E(-) |
| Total | a+c = D(+) | b+d = D(-) | a+b+c+d = total |
measures of disease effect or association
Conditional probability refresher = \(P(A|B)=\frac{P(A \cup B)}{P(B)}\)
Measures magnitude of risk. Does not take into account the unexposed population or whether risk is associated to exposure. \[P(D^+|E^+)=\frac{a}{a+b}=\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:during\:period\:t}\]
Measures the strength of the association and possible causal relationship
RR = 1 \(\to\) no effect
\[\frac{P(D^+|E^+)}{P(D^+|E^-)}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}=\frac{incidence\:in\:exposed\:population}{incidence\:in\:unexposed\:population}\] It can be expressed as:
\[\frac{P(D^+|E^+)}{P(D^+|E^-)}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}=\frac{D^+\:in\:exposed\:population}{D^-\:in\:unexposed\:population}\]
\[\frac{D^+\:in\:person-time\:of\:exposed\:population}{D^+\:in\:person-time\:of\:unexposed\:population}=\frac{incidence\:in\:exposed\:population}{incidence\:in\:unexposed\:population}\]
Measures the strength of the association but cannot suggest a causal relationship
\[\frac{\frac{P(D^+|E^+)}{P(D^-|E^+)}}{\frac{P(D^+|E^-)}{P(D^-|E^-)}}=\frac{\frac{a}{b}}{\frac{c}{d}}=\frac{ad}{bc}=\frac{odds\:D^+\:in\:exposed\:population}{odds\:D^+\:in\:unexposed\:population}\] or
\[\frac{\frac{P(E^+|D^+)}{P(E^-|D^+)}}{\frac{P(E^+|D^-)}{P(E^-|D^-)}}=\frac{\frac{a}{c}}{\frac{b}{d}}=\frac{ad}{bc}=\frac{odds\:E^+\:in\:diseased\:population}{odds\:E^+\:in\:not\:diseased\:population}\]
A: cohort study, B: case-control study, Gordis Figure 11-5
Gordis Figure 11-9
Odds Ratio (including matched-pairs odds ratio, and the “rare disease assumption”) Attributable Risk Etiologic Fraction Population Attributable Risk Risk/probability Rate Ratio
Precision Validity
Necessary, Sufficient Koch-Henle Criteria Bradford Hill Criteria
Selection Information/misclassification (differential/non-differential) Confounding Methods for identifying/detecting confounding Methods for controlling confounding
Additive Multiplicative Absolute vs. Relative Measures of Effect
chance or random variation that remains unexplained
This is the degree to which a sample population deviates from the total population. It’s unpredictable and due to the sampling process.
A sample is the a subset of the subjects in the population that could have been included in the study/ a subset of the experiences the study subjects could have had.
ASSUMPTIONS OF SAMPLING
Randomness assumption: the sample is a random selection of the subjects in the population that could have been included in the study
Representativeness assumption: the sample is representative of the subjects in the population that could have been included in the study
insert irva/Causal Inference: What If by Miguel A. Hernán, James M. Robins here
“The questions that motivate most studies in the health, social and behavioral sciences are not associational but causal in nature. For example, what is the efficacy of a given drug in a given population? What was the cause of death of a given individual, in a specific incident? These are causal questions because they require some knowledge of the data-generating process; they cannot be computed from the data alone, nor from the distributions that govern the data” (@Pearl2009).
"The aim of standard statistical analysis is to assess parameters of a distribution from samples drawn of that distribution. With the help of such parameters, associations among variables can be inferred, which permits the researcher to estimate probabilities of past and future events and update those probabilities in light of new information. These tasks are managed well by standard statistical analysis so long as experimental conditions remain the same. Causal analysis goes one step further; its aim is to infer probabilities under conditions that are changing, for example, changes induced by treatments or external interventions.
This distinction implies that causal and associational concepts do not mix; there is nothing in a distribution function to tell us how that distribution would differ if external conditions were to change—say from observational to experimental setup—because the laws of probability theory do not dictate how one property of a distribution ought to change when another property is modified. This information must be provided by causal assumptions which identify relationships that remain invariant when external conditions change.
Causal relations cannot be expressed in the language of probability and, hence, that any mathematical approach to causal analysis must acquire new notation – probability calculus is insufficient. To illustrate, the syntax of probability calculus does not permit us to express the simple fact that “symptoms do not cause diseases,” let alone draw mathematical conclusions from such facts. All we can say is that two events are dependent—meaning that if we find one, we can expect to encounter the other, but we cannot distinguish statistical dependence, quantified by the conditional probability P(disease|symptom) from causal dependence, for which we have no expression in standard probability calculus." (Pearl 2010)
The difference between association and causality is that causality is directional, which cannot be represented with standard calculus notation.
A statistical association between an exposure and an outcome can be due to either or both a :
A third effect between 2 variables can be
We will look into causal inference using a working example from Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second edition by Richard McElreath: Correlation between marriage rate (the exposure) and divorce rate (the outcome).
There are three observed variables in play: divorce rate (D), marriage rate (M), and the median age at marriage (A) in each State of the U.S. Both marriage rates and median age at marriage are great predictors of the divorce rate in a given State, but are these relationships causal?
The rate at which adults marry is a great predictor of divorce rate. But does marriage cause divorce? In a trivial sense it obviously does: One cannot get a divorce without first getting married. But there’s no reason high marriage rate must cause more divorce. It’s easy to imagine high marriage rate indicating high cultural valuation of marriage and therefore being associated with low divorce rate.
Age at marriage is also a good predictor of divorce rate— higher age at marriage predicts less divorce. But there is no reason this has to be causal, either, unless age at marriage is very late and the spouses do not live long enough to get a divorce.
# load data and copy
library(rethinking)
data(WaffleDivorce)
d <- WaffleDivorce
# standardize variables
d$D <- standardize( d$Divorce )
d$M <- standardize( d$Marriage )
d$A <- standardize( d$MedianAgeMarriage )
\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)
\(\mu_{i} = \alpha + \beta_{A}A_{i}\)
The outcome and the predictor are both standardized, the intercept α should end up very close to zero.
What does the prior slope \(\beta_{A}\) imply? If \(\beta_{A}\) = 1, that would imply that a change of one standard deviation in age at marriage is associated likewise with a change of one standard deviation in divorce. To know whether or not that is a strong relationship, you need to know how big a standard deviation of age at marriage is:
sd( d$MedianAgeMarriage )
## [1] 1.24363
So when \(\beta_{A}\) = 1, a change of 1.2 years in median age at marriage is associated with a full standard deviation change in the outcome variable. That seems like an insanely strong relationship.
m5.1 <- quap(
alist(
D ~ dnorm( mu , sigma ) ,
mu <- a + bA * A ,
a ~ dnorm( 0 , 0.2 ) ,
#when βA = 1, a change of 1.2 years in median age at marriage is associated with a full standard deviation change in the outcome variable (divorce)
#only 5% of plausible slopes more extreme than 1.
bA ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data = d )
precis(m5.1)
## mean sd 5.5% 94.5%
## a -4.529576e-06 0.09737576 -0.1556298 0.1556207
## bA -5.684081e-01 0.10999552 -0.7442021 -0.3926140
## sigma 7.882938e-01 0.07800347 0.6636292 0.9129584
posterior for \(\beta_{A}\) is reliably negative, as seen:
# compute percentile interval of mean
A_seq <- seq( from=-3 , to=3.2 , length.out=30 )
mu <- link( m5.1 , data=list(A=A_seq) )
mu.mean <- apply( mu , 2, mean )
mu.PI <- apply( mu , 2 , PI )
# plot it all
plot( D ~ A , data=d , col=rangi2 )
lines( A_seq , mu.mean , lwd=2 )
shade( mu.PI , A_seq )
\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)
\(\mu_{i} = \alpha + \beta_{M}M_{i}\)
m5.2 <- quap(
alist(
D ~ dnorm( mu , sigma ) ,
mu <- a + bM * M ,
a ~ dnorm( 0 , 0.2 ) ,
bM ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data = d )
precis(m5.2)
## mean sd 5.5% 94.5%
## a -0.0000374805 0.10824756 -0.1730380 0.1729630
## bM 0.3500545490 0.12592919 0.1487954 0.5513137
## sigma 0.9102788527 0.08986571 0.7666561 1.0539016
# compute percentile interval of mean
M_seq <- seq( from=-3 , to=3.2 , length.out=30 )
mu <- link( m5.2 , data=list(M=M_seq) )
mu.mean <- apply( mu , 2, mean )
mu.PI <- apply( mu , 2 , PI )
# plot it all
plot( D ~ M , data=d , col=rangi2 )
lines( M_seq , mu.mean , lwd=2 )
shade( mu.PI , M_seq )
This relationship isn’t as strong as the previous one.
The pattern we see in the previous two models is symptomatic of a situation in which only one of the predictor variables, A in this case, has a causal impact on the outcome, D, even though both predictor variables are strongly associated with the outcome.
The total causal effect is the sum of the direct and indirect effects
Example: age of marriage influences divorce in two ways.
Example: a direct effect would arise because younger people change faster than older people and are therefore more likely to grow incompatible with a partner.
Example: age of marriage has an indirect effect by influencing the marriage rate, which then influences divorce. If people get married earlier, then the marriage rate may rise, because there are more young people. Consider for example if an evil dictator forced everyone to marry at age 65. Since a smaller fraction of the population lives to 65 than to 25, forcing delayed marriage will also reduce the marriage rate. If marriage rate itself has any direct effect on divorce, maybe by making marriage more or less normative, then some of that direct effect could be the indirect effect of age at marriage.
The exposed and the unexposed in the study are not comparable, or exchangeable, which is the ultimate source of the bias ( the unexposed group is not the counterfactual of the exposed group)
There is confounding when the association between exposure and outcome includes a noncausal component attributable to their having an uncontrolled common cause.
There is selection bias when the association between exposure and outcome includes a noncausal component attributable to restricting the analysis to certain level(s) of a common effect of exposure and outcome or, more generally, to conditioning on a common effect of variables correlated with exposure and outcome.
Confounder
a variable that is:
automatic variable selection procedures: i.e stepwise regression. It assumes that all important confounders will be selected
change in effect estimate: comparison of the effect estimates between adjusted and unadjusted effect estimates. The variable is selected as a confounder if there is a relative change greater than 10%. It assumes that any variable substantially associated with an estimate change is worth adjusting for.
Statistical criteria alone are insufficient to characterize either confounding or selection bias.
The presence of common causes, and therefore of confounding, can be represented by causal diagrams known as directed acyclic graphs (DAGs).
diagrams that link variables by arrows that represent direct causal effects (protective or causative) of one variable on another.
There are only four types of variable relations that combine to form all possible paths:
the CONFOUNDER = fork: X ← Z → Y. This is the classic confounder: some variable Z is a common cause of X and Y, generating a correlation between them. If we condition on Z, then learning X tells us nothing about Y. X and Y are independent, conditional on Z.
the PIPE = intermediary: X → Z → Y. The treatment X influences Z which influences Y. If we condition on Z, we block the path from X to Y. X and Y are independent, conditional on Z.
the COLLIDER = common effect: X → Z ← Y. Conditioning on Z, the collider variable, opens the path. X and Y are dependent, conditional on Z, however neither X nor Y has any causal influence on the other.
the DESCENDENT = association?: Z \(\to\) D. Descendent is a variable influenced by another variable. Conditioning on a descendent partly conditions on its parent. Conditioning on D will also condition, to a lesser extent, on Z because D has some information about Z.
Path : any series of variables you could walk through to get from one variable to another, ignoring the directions of the arrows.
Blocking all confounding paths between some predictor X and some outcome Y is known as shutting the backdoor, thus eliminating spurious associations that are non-causal.
Example:
There are two paths connecting E and O: (1) E → O (2) E ← C → O.
Both of these paths create a statistical association between E and O. But only the first path is causal. The second path is non-causal. If only the second path existed, and we changed E, it would not change O. Any causal influence of E on O operates only on the first path.
Manipulation removes the influence of C on E: when we determine E, the C variable does not influence E, thus blocking the non-causal path between E and O (E ← C → O). Once the path is blocked, there is only one way for information to go between E and O, and then measuring the association between E and O would yield a useful measure of causal influence.
Adding C to the model blocks the non-causal path E ← C → O.
Why? Think of this path in isolation, as a complete model.
Once you learn C, also learning E will give you no additional information about O.
Example: Suppose for example that C is the average wealth in a region. Regions with high wealth have better schools, resulting in more education (exposure E), as well as better paying jobs, resulting in higher wages (outcome O). If you don’t know the region a person lives in, learning the person’s education E will provide information about their wages O, because E and O are correlated across regions. But after you learn which region a person lives in, assuming there is no other path between E and O, then learning E tells you nothing more about O. This is the sense in which conditioning on C blocks the path—it makes E and O independent, conditional on C.
Obtaining an unbiased estimate of the total causal effect requires measuring and adjusting for all confounders of the E \(\to\) O association
Obtaining an unbiased estimate of the direct causal effect requires measuring and adjusting for all confounders of both the
Do-operator
do(E) closes the backdoor paths into E, as in a manipulative experiment.
P(O|do(E)) defines a causal relationship because it tells us the expected result of manipulating E on O
*Confounding: P(O|E) \(\ne\) P(O|do(E)). The relationship between the E and O when the backdoor paths are closed is not the same, indicating that there is confounding.
*Conditional probability, non-causal: P(O|E) \(\ne\) P(O|not-E) doesn’t close the backdoor, and therefore does not give a causal relationship.
*Total causal relationship: if P(O|do(E)) \(\ne\) P(O|not-E), then E is the cause of O.
*Direct causal relationship: might require closing more backdoor paths.
To obtain the total causal effect we condition on C but not J:
dag1.4 %>%
ggdag_dseparated(from = "E", to = "O", controlling_for = "C")+
theme_dag()
dag1.4 %>%
ggdag_dseparated(from = "E", to = "O", controlling_for = "J")+
theme_dag()
dag1.4 %>%
ggdag_dseparated(from = "E", to = "O", controlling_for = c("J", "C"))+
theme_dag()
dag1.5 %>%
ggdag_dseparated(from = "E", to = "O", controlling_for = "D")+
theme_dag()
dag1.5 %>%
ggdag_dseparated(from = "E", to = "O", controlling_for = c("C", "D", "J"))+
theme_dag()
To infer the strength of these different arrows, we need more than one statistical model.
The total causal effect is the sum of the direct and indirect effects
Model m5.1, the regression of D on A, tells us only that the total influence of age at marriage is strongly negative with divorce rate. The “total” here means we have to account for every path from A to D. There are two such paths in this graph: A → D, a direct path,and A → M → D, an indirect path.
\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)
\(\mu_{i} = \alpha + \beta_{A}A_{i}\)
m5.1 <- quap(
alist(
D ~ dnorm( mu , sigma ) ,
mu <- a + bA * A ,
a ~ dnorm( 0 , 0.2 ) ,
#when βA = 1, a change of 1.2 years in median age at marriage is associated with a full standard deviation change in the outcome variable (divorce)
#only 5% of plausible slopes more extreme than 1.
bA ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data = d )
precis(m5.1)
## mean sd 5.5% 94.5%
## a -2.460763e-05 0.09737726 -0.1556523 0.1556031
## bA -5.684075e-01 0.10999764 -0.7442050 -0.3926100
## sigma 7.883097e-01 0.07800738 0.6636388 0.9129805
dag1.3 %>%
ggdag_dseparated(from = "A", to = "D", controlling_for = "A")+
theme_dag()
In general, it is possible that a variable like A has no direct effect at all on an outcome like D. It could still be associated with D entirely through the indirect path. That type of relationship is known as mediation.
As you’ll see however, the indirect path does almost no work in this case. How can we show that?
We know from m5.2 that marriage rate is positively associated with divorce rate. But that isn’t enough to tell us that the path M → D is positive. It could be that the association between M and D arises entirely from A’s influence on both M and D. Like this:
This DAG is also consistent with the posterior distributions of models m5.1 and m5.2. Why? Because both M and D “listen” to A. They have information from A. So when you inspect the association between D and M, you pick up that common information that they both got from listening to A.
So which is it? Is there a direct effect of marriage rate, or rather is age at marriage just driving both, creating a spurious correlation between marriage rate and divorce rate? To find out, we need to consider carefully what each DAG implies.
Testable implications can be read off the diagrams using a graphical criterion known as d- separation (Pearl, 1988). Each diagram encodes causal assumptions, each corresponding to a missing arrow or a missing double-arrow between a pair of variables.
DAGs imply that some variables are independent of others under certain conditions, therefore the testable implications of a DAG are it’s CONDITIONAL INDEPENDENCIES.
CONDITIONAL INDEPENDENCIES describe which variables should be associated with one another (or not) in the data, and which variables become disassociated when we condition on some other set of variables.
Condition independencies are pairs of variables that are not associated, once we condition on some set of other variables.
Conditioning: conditioning on a variable Z means learning its value and then asking if X adds any additional information about Y. If learning X doesn’t give you any more information about Y, then we might say that Y is independent of X conditional on Z. This conditioning statement is sometimes written as: \(Y \!\perp\!\!\!\perp X|Z\)
\(X \not\!\perp\!\!\!\perp Y\) means “not independent of”
\(X \!\perp\!\!\!\perp Y\) means “independent of”
In our divorce example
If we look in the data and find that any pair of variables are not associated, then something is wrong with the DAG (assuming the data are correct). In these data, all three pairs are in fact strongly associated. Check for yourself. You can use cor to measure simple correlations. Correlations are sometimes terrible measures of association—many different patterns of association with different implications can produce the same correlation. But they do honest work in this case.
cor(d$D, d$M)
## [1] 0.3737314
cor(d$D, d$A)
## [1] -0.5972392
cor(d$M, d$A)
## [1] -0.721096
ggdag(dag1.3) +
theme_dag()
This DAG says:
There are 3 causal assumptions that can be tested (one for every arrow).
Before we condition on anything, we assume everything is associated with everything else.
The testable implications are:
+implied conditional independencies = none
DMA_dag1 <- dagitty('dag{ D <- A -> M -> D }')
impliedConditionalIndependencies( DMA_dag1 )
ggdag(dag1.2) +
theme_dag()
In this DAG, it is still true that all three variable are associated with one another. A is associated with D and M because it influences them both. And D and M are associated with one another, because M influences them both. They share a cause, and this leads them to be correlated with one another through that cause. There are 3 causal assumptions that can be tested (one for every arrow). Before we condition on anything, we assume everything is associated with everything else.
(2) M causes D
But suppose we condition on A. All of the information in M that is relevant to predicting D is in A. So once we’ve conditioned on A, M tells us nothing more about D. So in the second DAG, a testable implication is that D is independent of M, conditional on A. In other words, \(D \!\perp\!\!\!\perp M|A\)
+The testable implications are:
All 3 variables should be associated, before conditioning on anything:
\(D \not\!\perp\!\!\!\perp A\) A not independent of D
\(D \not\!\perp\!\!\!\perp M\) M not independent of D
\(A \not\!\perp\!\!\!\perp M\) D not independent of A
\(D \!\perp\!\!\!\perp M|A\) D and M should be independent after conditioning on A.
DMA_dag2 <- dagitty('dag{ D <- A -> M }')
impliedConditionalIndependencies( DMA_dag2 )
## D _||_ M | A
The only implication that differs between these DAGs is the last one:\(D \!\perp\!\!\!\perp M|A\) D and M should be independent after conditioning on A.
To test this implication, we need a statistical model that conditions on A, so we can see whether that renders D independent of M. And that is what multiple regression helps with. It can address a useful descriptive question: Is there any additional value in knowing a variable, once I already know all of the other predictor variables?
So for example once you fit a multiple regression to predict divorce using both marriage rate and age at marriage, the model addresses the questions: (1) After I already know marriage rate, what additional value is there in also knowing age at marriage? (2) After I already know age at marriage, what additional value is there in also knowing marriage rate?
The parameter estimates corresponding to each predictor are the (often opaque) answers to these questions. The questions above are descriptive, and the answers are also descriptive. It is only the derivation of the testable implications above that give these descriptive results a causal meaning. But that meaning is still dependent upon believing the DAG.
For each predictor, the parameter measures its conditional association with the outcome.
\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)
\(\mu_{i} = \alpha + \beta_{M}M_{i} + \beta_{A}A_{i}\)
m5.3 <- quap(
alist(
D ~ dnorm( mu , sigma ) ,
mu <- a + bM*M + bA*A ,
a ~ dnorm( 0 , 0.2 ) ,
bM ~ dnorm( 0 , 0.5 ) ,
bA ~ dnorm( 0 , 0.5 ) ,
sigma ~ dexp( 1 )
) , data = d )
precis( m5.3 )
## mean sd 5.5% 94.5%
## a -1.300476e-05 0.09707549 -0.1551584 0.1551324
## bM -6.538231e-02 0.15077212 -0.3063453 0.1755807
## bA -6.135495e-01 0.15098214 -0.8548481 -0.3722509
## sigma 7.851123e-01 0.07784195 0.6607058 0.9095188
dag1.3 %>%
ggdag_dseparated(from = "A", to = "D", controlling_for = c("A", "M"))+
theme_dag()
The posterior mean for marriage rate, bM, is now close to zero, with plenty of probability of both sides of zero. The posterior mean for age at marriage, bA, is essentially unchanged. It will help to visualize the posterior distributions for all three models, focusing just on the slope parameters βA and βM:
plot(coeftab(m5.1,m5.2,m5.3), par=c("bA","bM"))
bA doesn’t move, only grows a bit more uncertain, while bM is only associated with divorce when age at marriage is missing from the model. You can interpret these distributions as saying: Once we know median age at marriage for a State, there is little or no additional predictive power in also knowing the rate of marriage in that State, which means \(D \!\perp\!\!\!\perp M|A\). D and M are independent after conditioning on A, which corresponds to the second DAG.
Note that this does not mean that there is no value in knowing marriage rate. Consistent with the earlier DAG, if you didn’t have access to age-at-marriage data, then you’d definitely find value in knowing the marriage rate. M is predictive but not causal. Assuming there are no other causal variables missing from the model, this implies there is no important direct causal path from marriage rate to divorce rate. The association between marriage rate and divorce rate is spurious, caused by the influence of age of marriage on both marriage rate and divorce rate.
We’re interested in the total causal effect of the number of Waffle Houses on divorce rate in each State. Presumably, the naive correlation between these two variables is spurious. What is the minimal adjustment set that will block backdoor paths from Waffle House to divorce?
Let’s make a graph:
## { A, M }
## { S }
We could control for either A and M or for S alone. This DAG is obviously not satisfactory—it assumes there are no unobserved confounds, which is very unlikely for this sort of data. But we can still learn something by analyzing it. While the data cannot tell us whether a graph is correct, it can sometimes suggest how a graph is wrong.
Inspecting implied conditional independencies, we can at least test some of the features of a graph.
impliedConditionalIndependencies( dag6 )
## A _||_ W | S
## D _||_ S | A, M, W
## M _||_ W | S
The median age of marriage should be independent of (||) Waffle Houses, conditioning on (|) a State being in the south.
Divorce and being in the south should be independent when we simultaneously condition on all of median age of marriage, marriage rate, and Waffle Houses.
Marriage rate and Waffle Houses should be independent, conditioning on being in the south.
include: colliders, multicollinearity, post-treatment bias. end of chapter 6
INTERACTIONS chapter 8
Testing vs estimation:
Hypothesis testing: P-value does not give information about: * direction of association * magnitude of the effect (it mixes precision with magnitude) * it depends on sample size * the clinical relevance is not clear.
Estimation: effect size (OR, RR) and precision (95% CI) are separated.
Predictive vs estimation models Goal: Predicition: the goal is to determine combination of factors that provides the best prediction of an outcome. variable selection determined by strength of association
Estimation of association: determine the unbiased association of one factor with another. variable selection determined by change in the estimate of association
CONFOUNDING =/ effect modifier =/ interaction